Fully automated determination of the cervical vertebrae maturation stages using deep learning with directional filters

Introduction We aim to apply deep learning to achieve fully automated detection and classification of the Cervical Vertebrae Maturation (CVM) stages. We propose an innovative custom-designed deep Convolutional Neural Network (CNN) with a built-in set of novel directional filters that highlight the edges of the Cervical Vertebrae in X-ray images. Methods A total of 1018 Cephalometric radiographs were labeled and classified according to the Cervical Vertebrae Maturation (CVM) stages. The images were cropped to extract the cervical vertebrae using an Aggregate Channel Features (ACF) object detector. The resulting images were used to train four different Deep Learning (DL) models: our proposed CNN, MobileNetV2, ResNet101, and Xception, together with a set of tunable directional edge enhancers. When using MobileNetV2, ResNet101 and Xception, data augmentation is adopted to allow adequate network complexity while avoiding overfitting. The performance of our CNN model was compared with that of MobileNetV2, ResNet101 and Xception with and without the use of directional filters. For validation and performance assessment, k-fold cross-validation, ROC curves, and p-values were used. Results The proposed innovative model that uses a CNN preceded with a layer of tunable directional filters achieved a validation accuracy of 84.63%84.63% in CVM stage classification into five classes, exceeding the accuracy achieved with the other DL models investigated. MobileNetV2, ResNet101 and Xception used with directional filters attained accuracies of 78.54%, 74.10%, and 80.86%, respectively. The custom-designed CNN method also achieves 75.11% in six-class CVM stage classification. The effectiveness of the directional filters is reflected in the improved performance attained in the results. If the custom-designed CNN is used without the directional filters, the test accuracy decreases to 80.75%. In the Xception model without the directional filters, the testing accuracy drops slightly to 79.42% in the five-class CVM stage classification. Conclusion The proposed model of a custom-designed CNN together with the tunable Directional Filters (CNNDF) is observed to provide higher accuracy than the commonly used pre-trained network models that we investigated in the fully automated determination of the CVM stages.

The key original contributions of this paper are as follows: • The custom-designed convolutional neural network (CNNDF) incorporates directional filters as an initial layer to emphasize the edges of Cervical Vertebrae.
• The convolutional directional filter layer parameters are also updated during the training process.
• The CNNDF has fewer parameters than the pre-trained networks (MobileNetV2, ResNet101 and Xception) and it does not require data augmentation.
• The directional filter layer improves the classification accuracy of CVM stages not only in the proposed CNNDF but also in pre-trained networks and the classical SVM method.

II. Material and methods
The data set used in developing our algorithm consists of digitized images of scanned lateral cephalometric films for subjects aged between 4 and 29 obtained from the American Association of Orthodontists Foundation (AAOF) Craniofacial Growth Legacy Collections, an open data source. Films from Michigan, Bolton Brush, and Iowa Facial Growth Studies were used in this study [22]. The AAOF legacy collection was collected from growth studies, so they are longitudinal records. Since we used an open-source dataset, the present study was granted IRB exempt status (2021-0480) by the Office of Human Subjects Protection at The University of Illinois at Chicago. All the films were downloaded as jpg files using Firefox FTP software (Mountain View, Calif) and they do not require any special access privileges to access. All cephalometric radiographs included in this study are of adequate quality, with the second (C2), third (C3), and fourth (C4) cervical vertebrae clearly observable. The exclusion criteria were limited to poor-quality images or malformations in the head and neck region. The images were studied and labeled by an expert Orthodontist Scientist (MHE) with more than ten years of experience in classifying CVM. Cervical maturation stages were classified into six stages (CS1-CS6) in keeping with the methodology from previous studies [2,23]. Cervical Stage 1 (CS1): the inferior borders of vertebral bodies C2 to C4 are flat, and the third and fourth cervical bodies are trapezoidal. Cervical Stage 2 (CS2): visible notch along the inferior border of the second cervical vertebra, while the lower borders of the third and fourth vertebral bodies remain flat with both C3 and C4 retaining a trapezoidal shape. Cervical Stage 3 (CS3): visible notching of the inferior borders of C2 and C3; the inferior border of C4 remains flat while the C3 and C4 bodies still retain a trapezoidal shape. Cervical Stage 4 (CS4): the inferior borders of vertebral bodies C2 to C4 have obvious concavities along their inferior surfaces, with the shapes of C3 and C4 bodies being horizontally elongated rectangular rather than trapezoidal. Cervical Stage 5 (CS5): differentiated from CS 4 based on the shapes of C3 and/or C4, with these body shapes becoming square. All three cervical bodies have notches. Cervical Stage 6 (CS6): at least one of the third and fourth cervical bodies has assumed a vertically elongated rectangular morphology. The length of the posterior border is longer than the inferior border. Besides, the cortical bone appears better delineated at CS 6 than at CS 5.
The main data set was classified into six stages of CVM (CS1-CS6) [2,12,23]. Another improved method of the CVM assessment introduced by Baccetti et al. [24] classified the CVM into five stages (CVMS I-CVMS V). In the five-stage classification, the CS1 and CS2 stages were merged into a single stage referred to as CVMS I (the lower borders of all the three vertebrae are flat, with the possible exception of a concavity at the lower border of C2).
Furthermore, CS3 is denoted as CVMS II. We assessed our model in performing classification into both five CVM stages (CVMS I-CVMS V) and six CVM stages (CS1-CS6). In this paper, we examine both the six-stage and five-stage classification problems.
The principal evaluator (MHE) repeated the classification process two weeks later, and the intra-examiner reproducibility for the cervical vertebral stages was tested by weighted kappa (wk). The intra-examiner agreement was almost perfect (wk = 0.95). Furthermore, another evaluator, an Oral and Maxillofacial surgeon (OS), performed the repeated classification process, and the inter-examiner agreement was strong (wk = 0.90). The final data set of 1018 classified images by the principal evaluator (MHE) was used in this study based on the MHE's greater experience. In our data set, the number of lateral cephalograms belonging to cervical stages CS1, CS2, CS3, CS4, CS5, and CS6 were 154, 187, 174, 159, 167, and 177, respectively. In CVMS classification, CS1 and CS2 classes are merged to form the CVMS I class. CVMS II is the same as CS3, CVMS III is the same as CS4, and so on.
Data Augmentation for Pre-trained Networks: We note that our proposed CNN with directional filter layer does not need data augmentation during training. However, pre-trained deep learning algorithms such as Xception usually require huge training data sets to produce accurate classification results. If the size of the training data set is small, then the Xception model may overfit the training data and achieve high accuracy on the training data set while performing poorly on validation and testing data sets. There are several solutions to avoid overfitting the model to the training data set. These include transfer learning, batch normalization, and data augmentation. Since our data set is not very large, we use data augmentation and transfer learning to increase the number of images in our data set to prevent overfitting in the case of pre-trained networks MobileNetV2, ResNet101, and Xception To augment the data, new images are created by shifting the RoI images to the right, left, up and down, and by rotating them clockwise and counterclockwise. Our CVM image data set has 1018 images. Of these 1018 images, 761 are used in the training set and 257 in the testing set, respectively. These 257 testing images are set aside to use for the evaluation of the model at the end. By implementing the augmentation methods, we created 10 additional images out of each image in the training set. A total of 761×11 = 8371 images is generated using augmentation for training. The density distribution of the classes is preserved after the augmentation. The augmented data set is used as a training set only for the pre-trained deep learning algorithms. Again, we emphasize that the proposed CNN with directional filter layer does not need data augmentation during training.
The data set contains labeled images a large subset of which serves as a training set needed to develop a deep learning model. Since the edges of the vertebral body shapes are critical in determining the CVM stage, we decided to investigate a custom-designed innovative deep learning model that includes tunable preprocessing to emphasize the edges in the input image. Our deep learning architecture is based on Convolutional Neural Network (CNN), which is a classifier widely used in image recognition applications. To prepare our data set for training, the images are first segmented to extract the region of interest. The deep learning architecture is augmented with a bank of directional filters aimed at improving the accuracy of the classification task. The results attest to the improvement in accuracy due to the use of the proposed augmented architecture.
In the following subsections, we describe the steps of our lateral cephalogram classification algorithm. We first segment the images to extract the regions of interest using an aggregate channel feature object detector as described in Subsection II-A. The directional filters used in this study are described in Subsection II-B; and finally, the custom designed CNN model is delineated in Subsection II-C.

A. Image RoI segmentation
We first segment the image and identify the spine (Cervical Vertebrae) region using the socalled Aggregate Channel Features (ACF) object detector [25]. It is a classical computer vision method that analyzes a given image in sliding windows using image feature pyramids. This avoids the process of manually cropping the spine region in each image in the database. The use of the ACF object detector helps automate the process of preparing images for subsequent analysis with the deep learning algorithm, thereby making the overall processing computationally more efficient. As a result, the skull, jaw, and irrelevant background regions are removed before the images are applied to the deep learning algorithm. The ACF object detector automatically extracts the Region of Interest (RoI) in the images thereby reducing the search space of the deep learning structure.
We trained the ACF object detector using 300 images and applied it to extract relevant RoI sub-images from the 1018 larger images in our data set. The ACF detector correctly identified the RoIs in 703 out of the remaining 718 images, yielding an accuracy of 98%. In the remaining 15 images, we manually cropped the ROI before feeding it to our proposed CNNDF model and pre-trained MobileNetV2, ResNet101, and Xception models. Because all the segmented images have variable size, they are resized to a common size of 77x35. We note that resizing the images can be a sensitive operation because it might change the relative dimensions of the vertebrae. This may impact classification as the fundamental difference between CVM stages is the size and the curvature of the vertebrae in X-ray images. Therefore, the resizing should be done without changing the aspect ratio of the image. Next, instead of feeding the resized ROI images directly to the DL networks, the edges of vertebral bodies in the ROI images are emphasized using eight directional filters described in [26,27]. The outputs of directional filters are fed to the DL networks. In fact, we make the directional filters the first layer of the DL networks and even update their parameters during the network training process making the layer parameters tunable.
We review the directional filters in the next subsection.

B. Directional filters
Deep learning structures contain many convolutional filters whose weights are learned during training. These convolutional filters are initialized with random numbers in general. In this section, we introduce an initial layer consisting of eight directional filters. We initialize these filters in such a way that they highlight the edges of X-ray images in eight directions. We design the initial values of the filters using a Fourier domain method. During the standard learning process, we also allow the network to update the values of the front-end layer filters. Outputs of directional filters provide multi-channel inputs to the rest of the deep CNN. Since we have eight directional filters, we have more input layers than the commonly used three input layers. In the pre-trained models we investigated, we first use a stage of directional filtering following which we adhere to the available model of the pre-trained network. In designing the directional filters, we focused on using simple low-order filters that work adequately for the task. Higher-order filters produce better frequency characteristics but may cause ringing artifacts at edges in an image. We used a one-dimensional high-pass prototype filter obtained from a 7-th order half-band Lagrangian maximally flat lowpass filter with the following transfer function to design the directional filters [28,29]: with a DC gain of H lp (e j0 ) = H lp (1) = 2.
The high-pass filter is obtained with the transformation H hp (z) = H lp (−z), and the corresponding impulse response of the high-pass filter is This filter is concatenated with a lowpass filter with impulse response h c [n] = {1,2,1} to avoid the amplification of noise at high frequencies and the overall frequency response is that of a band-pass filter: where � represents the convolution operation. The frequency response H bp (e jω ) of the filter with impulse response h bp is shown in Fig 1A. The impulse response h 0 � of the prototype (horizontal) two-dimensional (2-D) filter obtained from h bp is: The 2-D impulse response obtained by rotating h 0 � by an angle θ degrees is denoted by h y � . The filter with impulse response h 0 � produces zero-crossings at the location of the vertical edges. We rotate the impulse response h 0 � by θ = −63.43˚, −45˚, −26.56˚, 0˚, 26.56˚, 45˚, 63.43˚, 90˚to obtain eight filters. Due to our choice of low-order filters, the lower density of samples in the diagonal direction produces multi-band responses in the case of θ = 45˚rather than the bandpass responses observed in the case of θ = 0˚or 90˚. The frequency response for θ = 90˚is shown in Fig 1B. For example, the filter corresponding to 45˚has the impulse which produces zero-crossings corresponding to 45˚edges. The coefficients of eight filters are provided in the Supporting information section. The frequency responses of the 2-D filters are shown in Supporting information section. The coefficients of the directional filters are obtained using the method described in [26]. Instead of rotating the coefficients of the horizontal filter h 0 � using the bilinear interpolation, the authors propose a method resulting a sparse set of filter coefficients whose sums along the θ = {±26.56± 90, ±45˚±90, ±63.43˚±90} are approximately equal to the coefficients of the horizontal filter. The filter with θ = 90 is simply the vertical version of the horizontal filter. The 2-D filter impulse responses corresponding to θ = {0˚, ±26.56˚, ±45˚, ±63.43˚, 90˚} are also available in [26].
The motivation behind using the directional filters is to start the deep learning model using our prior domain knowledge. Since the images among the classes are not dramatically different as in natural images, we estimate that highlighting the edges of the bones more than two directions will give an advantage to the deep network model. We illustrate the effect of the directional filters on an instance of cervical vertebrae in Fig 2. These observations motivated our use of the directional filters (DF) in the proposed method. The performance accuracy with and without directional filters is examined in Section III.
As a result, eight different channels are created at the output of the directional filters. These filters are implemented with direct convolution. Outputs of these filters serve as initial estimates in our networks. Instead of starting the initial layer with random numbers, we start the training process with the directional filters emphasizing edges at different angles.

C. Convolutional Neural Network with Directional Filter (CNNDF) layer
The most efficient and common trainable model to classify the images into a specific number of classes is the Convolutional Neural Network (CNN) [30,31]. Fundamentally, convolution layers are designed to extract the information from the images using a 2-D convolution operation. The channels created after the convolution layers can contain vital information regarding the class, and the channels are emphasized or de-emphasized accordingly. In this project, initially, we decided to use a custom CNN with residual learning to measure the capacity of a DL model on our data set. More elaborate models-MobileNetV2 [32], ResNet101 [33], and Xception [34] are also used for the purpose of comparison. These pre-trained networks are trained on ImageNet [35] and their performance are presented in ImageNet Challenge [36].
To determine how many layers to use, we try to strike a balance between overfitting and underfitting. It is common knowledge that there is always a chance that the model may overfit due to the data set characteristics or the complexity of the model. To achieve better balance, it is recommended to start from a simple CNN model, then increase model complexity as needed for better results. Using this strategy, 6 convolutional layers are found to be best suited for our classification task. However, due to the nature of the convolution layers, the problem of vanishing/exploding gradients might arise as the CNN gets deeper. Residual learning is introduced into the model as skip connection blocks to avoid the vanishing gradient problem [33]. In addition to the regular convolution layers, a convolution layer with no activation function and kernel size (1,1) is used to match the filter numbers in the so-called skip connection block. Batch normalization and dropout layers are key mechanisms adopted after the regular convolution layers to avoid overfitting as the model trains. The non-linear activation function "ReLU" is adopted after the batch normalization layers. The outputs of these two convolution layers are added to achieve the final output. Maximum pooling layers are used to decrease the number of parameters that the model must learn. The architecture of the proposed CNN is given in Table 1. The first layer consists of directional filters described in Section II.B. We believe that the directional filters help highlight the details and the differences of the classes through the layers.
As pointed out above, the initial layer of the CNN consists of directional filters which are initialized using the filter weights described in Section II.B. During training we also tuned the directional filters by updating their weights using the backpropagation algorithm. Instead of initializing these filters with random numbers, we initialize them with coefficients that can detect edges in different directions. In other words, we use our domain knowledge to guide the network to achieve the goal of recognizing CVM stages.
The choice of the model components discussed define the final structure of our model. The overview of our model is given in Fig 3. To find the optimal components, a series of deep learning models and pre-processing techniques are used. To compare the results; Mobile-NetV2, ResNet101, Xception, and our own fully connected CNN are used as benchmarks with various directional filter choices.

III. Results and comparisons
In this section, the results obtained and the comparison between different models with or without the pre-processing are presented. As mentioned earlier, the proposed method yields 75.11% accuracy for six classes (CS1-CS6). It achieves a higher 84.63% validation accuracy in 5-stage classification (CVMS I-CVMS V). The average accuracy attained and loss incurred in k-fold cross validation for classification into 5 stages are shown in Fig 4A and 4B, respectively. For comparison, the k-fold (k = 5) cross validation results for 6-stage classification over the epochs are depicted in Fig 5A and 5B. In this paper, 5-fold cross-validation is used to ensure that the model is applicable to any arbitrary data set. In every fold the testing data set and training data sets are different showing that the model works on any arbitrary testing data set. A training set which contains 761 images is used. We choose a 4:1 splitting in the 5-fold cross validation. Images are split before the augmentation, and then the augmentation is applied. Therefore, there is no overlap between the training fold and validation fold in any part of the  training process. We do not use any data augmentation during the training of the proposed CNN model with directional filters. Data augmentation is used only to train pre-trained models (MobileNetV2, ResNet101 and Xception). As seen from the Figs 4 and 5, the model performs better on 5-stage classification than on 6-stage classification. This difference should be interpreted with a closer look at the two-class classification for CS1 and CS2. It is observed that the difference between images of CS1 and CS2 is the curvature in vertebrae in CS2 which is not a strong differentiator and makes the error in distinguishing CS1 and CS2 high. This is the key reason why the accuracy is lower for 6-stage classification compared with 5-stage classification (CVMS I-CVMS II). We found that

PLOS ONE
Fully automated determination of the cervical vertebrae maturation stages using deep learning classification based on Baccetti et al.'s 5-stage CVM classification [24] provides more reliable results in all the networks that we investigated.
The confusion between CS1 and CS2 originates primarily but not entirely from the curvature difference between C1 and C2. The posture of the subjects also influences the decision. If an X-ray image obtained from a CS1 subject is inclined slightly downwards, there is a chance that this image is predicted as CS2 because of the posture. As a result, Baccetti et al.'s 5-stage classification is also adopted for classification in this paper. The detailed validation accuracy results of every fold are shown in Fig 6. The accuracy percentage is validated in every fold, which confirms the validity of the model. To evaluate the model on the testing data set, the confusion matrices of both 5-class and 6-class problems are shown in Fig 7. Since the model is validated over 5 folds, 5 different confusion matrices are obtained per classification and averaged. The average confusion matrix of 5-stage classification is shown in Fig 7A. For example, we had 39 CVMS I images in our test set. We correctly predicted 35.6 of these images (non-integer value is obtained due to averaging of 5 experiments); 2.4 of 39 images are classified as CVMS II; 0.6 images are predicted as CVMS III, 0.4 images are classified as CVMS IV, respectively. Other rows of the matrices are obtained in a similar manner. In the case of 6-stage classification, CS1 and CS2 classes have the lowest accuracy as shown in the first two rows of the confusion matrix in Fig 7B. We observe that the proposed model makes most of the misclassifications between neighboring classes.
To further evaluate the proposed method and possibly improve the model, we compare our method with other pre-trained models. MobileNetV2 [32], ResNet101 [33] and Xception [34] networks are used for comparison. In addition, a classical approach, Support-Vector Machine (SVM) with a Radial Basis Function (RBF) kernel is used for comparison.
To see the effect of using directional filters, we first implemented all the models without the Directional Filters (DF). As shown in Table 2, the accuracy of the model that uses the CNNDF is higher than that of models that do not use DFs in both 5-class and 6-class classifications, which emphasizes the impact of the directional filters. The best result with 84.63% accuracy is obtained in 5-stage classification problem with our custom CNNDF model. Resnet101 and MobileNetV2 with the 8 directional filters do not reach even 80% accuracy, achieving 74.10% and 78.54%, respectively, in 5-stage CVM classification. The well-known Xception network [34] derived from InceptionV3 [37] yields 80.86% accuracy, with performance closest to our CNNDF. Our CNNDF clearly outperforms other well-known image classification networks in terms of the accuracy of the classification results. The pre-trained models are developed to classify natural scene images and objects; their use in medical image analysis may not be as effective in classifying natural scenes. The other significant result from Table 2 is the classification result for 6-stage classification. The best performance is obtained when our CNNDF is used (75.1% accuracy). Moreover, our CNNDF has only 715 K parameters. MobileNetV2, the simplest algorithm among the DL networks that we considered, has 2.2 M parameters. As a result, CNNDF is also the most efficient in terms of the number of trainable parameters and it does not require any data augmentation. The other networks are much larger in size, and they require data augmentation for training. The SVM with the RBF kernel produces inferior results compared with the deep learning based methods and much lower accuracy than that of CNNDF.
Receiver Operating Characteristic (ROC) curve is a tool to measure the performance of a given ML model. It utilizes two different parameters: True Positive Rate (TPR) and False Positive Rate (FPR). These two parameters are defined as: where TP, FP, FN, TN are True Positive, False Positive, False Negative, and True Negative values, respectively. The ROC is a plot of TPR vs FPR at different threshold levels to show how well the model performs the given classification task. A related measure to assess the performance of the model is the Area Under Curve (AUC) derived from ROC. AUC assumes a value between 0 and 1, which a value close to 1 indicates that a reliable performance is achieved by the model. ROC can be generalized and extended to multiple classes using "one class vs rest" strategy, where one selected class is labeled as Class 1 while the remaining classes are together labeled as Class 0. This process is repeated for all five classes; therefore, five different ROCs are obtained. Based on these ROCs, the corresponding AUCs are calculated. The ROC plots and the corresponding AUC values are given in Fig 8. It is also worth noting that although our proposed model achieves 84.6% accuracy, the AUC values are all above 0.9. This is due to the adoption of "one vs rest" methodology. Although the model functions as the multi-class classifier, the ROC curves are obtained by using two alternate classes created by "one vs rest" distinction.
In addition to AUC values, precision, recall and F1 score values of 5-class classification and 6-class classification are given in Table 3A and 3B, respectively. Table 3A and 3B show that 5-stage classifier is effective in classifying the stages.
The impact of the directional filters are clearly visible in the fully connected CNN with seven convolutional layers because the recognition accuracy significantly increased from 80.75% to 84.63%. To establish the statistical significance of the improvement due to the use of DF, the p-value is calculated. If the p-value is smaller than the alpha level, a defined threshold for p-value chosen here to be 0.01, we reject the observed findings as arising from the null hypothesis. The null hypothesis (H 0 ) is that the observed accuracy of the fully connected CNN model does not improve with the use of DF, whereas the alternative (H 1 ) is that the CNN model performance improves with the DF. To obtain the p-value, the testing data set is divided into 7 subsets to increase the sample number. Each subset contains 30 images. The image distribution is set up according to the initial density of the classes. A normal distribution is fitted to the subset accuracy histograms for both hypotheses, and its properties are calculated. The means of the normal distributions of null and alternative hypotheses were computed as 0.778 and 0.826, respectively. Also, the standard deviations of two distributions are calculated as 0.0762 and 0.0633, respectively, and the p-value is much smaller than the alpha level of 0.01. To conclude, the null hypothesis is rejected with this p-value, i.e., the alternative H 1 , that the CNN model with the directional filters improves performance, is accepted. With the strong evidence obtained from the accuracy results and the p-value, we state that the use of the directional filters improved the performance since it helps the model with a prior knowledge on the difference among the classes.

IV. Discussion
The most common skeletal maturity indicators are skeletal maturation of hand and wrist and Cervical Vertebrae Maturation (CVM) stages. The hand-wrist method (HWM) has not changed significantly since Greulich and Pyle (GP) published their atlas in 1950. GP method compares the patient's radiograph with an atlas of representative ages and determines the bone age [38]. Another scoring method introduced by Tanner-Whitehouse (TW2) is based on examining 20 specific bones [39]; both methods are tedious, and for orthodontic practice, time-consuming; besides it needs additional radiation exposure [40,41]. Evaluation of CVM stages on the lateral cephalometric radiograph is preferred in clinical practice because it eliminates additional radiography [2]. However, the CVM stages methods are influenced by experience and training [42,43]. The AAOF collection was used to develop the CVM classification methods and to study the correlation between CVM development stages and growth. It is significant to state that no previous studies have been published specifically for applying Artificial Intelligence to CVM development staging using the AAOF legacy collection.
With the recent advances in AI, the automated diagnosis and interpretation of medical images are rapidly evolving aiming to increase accuracy, decrease variability and eliminate the tediousness of the task [44]. Determination of the growth and development by AI is a field that needs to be explored. This study aimed to develop a fully automated pipeline for the determination of the CVM stages. There have been a few attempts to develop computer-assisted methods to determine CVM stages; Baptista et al. developed a semi-automated method for CVM in 2011 [18]. They located and measured the 20 landmarks on every cephalogram. They constructed three classification models based on Naive Bayes (NB) and achieved 90% with one stage deviation. Their study showed us that using a pattern classification method can help orthodontics identify the CVM staging and achieve high accuracy. Amasya et al. compared different ML models on the CVM degrees classification [19].  [17]. Both studies have concluded that the most stable and accurate performance is achieved when ANN is used as the main model. This may lead us to the ANN's impact on the classification task; therefore, using a deep learning model on the determination of the CVM degree on the lateral cephalograms is a decent idea.
Until the work of Makaremi et al., the use of a DL model to determine the six CVM stages has not been investigated. They have proposed a DL model to determine the cervical vertebrae maturation degree using only an entropy filter [45]. Their model consists of two convolution layers followed by a max-pooling layer, one convolution layer followed by a max-pooling layer, and a fully connected neural network with a hidden layer. The CNN model of [45] achieves an accuracy of 70% on the 6-stage classification in our dataset. We could not make a comparison with the work of Makaremi et al. in their dataset because we do not have access to it. The entropy filter is used to measure the attributes of distributions of the pixel values of Xray images. However, entropy filters are not trainable in their model. One of the significant aspects of the work by Makaremi et al is to use entropy filters to help the DL model capture the context of the lateral cephalograms. That motivated us to examine the use of directional filters, which emphasize the CV edges in the cephalograms, in aiding the feature selection in subsequent deeper layers of the CNNDF structure.
Other X-ray image classification examples include [41,46,47]. Lee et al. constructed a fully automated DL system to analyze the hand, wrist radiographs and perform Bone Age Assessment (BAA), which indicates the growth [41]. They built a complete preprocessing system where they categorize, standardize, segment, and label the X-ray images to prepare the data set. The preprocessed and annotated training data set is fed into a variant of the Inception Network, the Google-Net [46]. Lee et al. attempted to create the first fully automated BAA using a DL model. A similar study is carried out by Tajmir et al. [47].
Our model shows high accuracy in classifying the CVM images into 5 classes (CVMS I-CVMS V), higher than the accuracy of classification into 6 stages. Baccetti et al. classified the CVM into 5 stages based on analysis of 6 consecutive observations of untreated subjects [24]; their findings revealed that no statistically significant discrimination exists between CS 1 and CS 2. The presence of a concavity at the lower border of the second cervical vertebra was not a distinctive feature of CVS 2 when compared with CVS 1 [24]. Therefore, they merged two former prepubertal stages (CS 1 and CS 2) into one stage. This merged Cervical Vertebral Maturation Stage is referred to as CVMS I. The peak in mandibular growth will occur not earlier than one year after this stage; hence the clinician can wait for one year at least for a radiographic re-evaluation to start treatment with a functional appliance [24]. Both six and five stages of CVM classification methods showed a high correlation with the middle phalanx of third finger ossification stages [46]. Classification into both six and five stages are usable in daily clinical practice as both can discriminate between prepubertal, circumpubertal, and postpubertal stages. [2,12,23,24,46,48]. Therefore, BAA can be examined with classification either into 5 different classes (CVMS I-CVMS V) or into 6 classes (CS1-CS6), as measured and compared in this paper. It is noted that a fully connected CNN with 6 convolutional layers model with the directional filters shows good performance in classification into 5 classes, achieving 84.63% accuracy on the testing data set.
The use of directional filters as a preprocessing layer improves the accuracy in CVM classification problem. However, this approach may not be applicable to an arbitrary object recognition problem but is suitable in applications where strengthening edge information aids classification. Our aim is to use our domain knowledge to guide the network by introducing the directional filters which emphasize the edges of the cervical vertebrae. Since the difference between CVM stages are determined according to the edges and curvatures of cervical vertebrae, we use the directional filter based-edge detector layer to improve the performance of the deep neural network. Our CNN with the directional filters (CNNDF) provides better results than pre-trained networks such as Xception. We believe that the strong performance of our method stems from the initial layer of directional filters. We also show that using a trainable preprocessing layer improves the accuracy result in the CVM image classification task. Another aspect of this model is the number of images that we use in our database. We used higher number of images than the Makaremi et al.'s study [45] and our data consisted of longitudinal consecutive cephalometric observations.

V. Conclusion
This paper presents a deep learning model with novel tunable pre-processing to classify lateral cephalograms into CVM stages. A custom-designed CNNDF model with eight tunable directional filters is introduced. Directional filter layer significantly improved the accuracy of the CNNDF and the other pre-trained networks in CVM stage classification problem. The p-value calculation reveals the significance of performance improvement due to the directional filters. Our experimental results shows that the proposed CNNDF model performs better that other pre-trained DL models. The CNNDF produces the best result compared with the pre-trained MobileNetV2, ResNet101, and Xception models, with or without the directional filters. Our method achieved an 84.63% and 75.1% in five-and six-class CVM stage classification problems, respectively, on the testing data set. The proposed CNNDF model can be used as an effective tool for determining the skeletal maturity stage and treatment timing, especially for clinicians with less experience; furthermore, it could have multiple forensic applications.